In this lab, we’ll look at how to implement a range of tree methods including basic classification and regression trees, random forests and boosted regression trees. We will also look at how to predict for new data and how to automate hyperparameter tuning.
We’ll illustrate these methods by using them to build a species
distribution model for the pinon pine in the western United States. You
will need the following datasets from Canvas, so download these to your
datafiles folder (extract any zip files). Make a new folder
for today’s class called lab07.
If you are using Python for today’s lab, you’ll need to download the
Jupyter notebook for this lab (GEOG_5160_6160_lab08.ipynb).
Make a new folder for the lab (lab08) and move the notebook
to this. Now open a new terminal (in Windows go to the [Start Menu] >
[Anaconda (64-bit)] > [Anaconda prompt]).
You will need to make sure the following packages are installed on your computer (in addition to the packages we have used in previous labs).
conda install xarray)conda install py-xgboost)Once opened, change directory to the folder you just made, activate your conda environment
conda activate geog5160
Start the Jupyter Notebook server:
jupyter notebook
And open the notebook for today’s class.
Start RStudio and set the working directory to this directory (This can be changed by going to the [Session] menu and selecting [Set working directory] -> [Choose directory…], then browsing to the folder).
You will need to make sure the following libraries are installed on your computer:
Before doing anything else, run the following command and make sure that you can see the files you downloaded.
list.files()
Next load the libraries you will need for the lab. You should at this
stage have most of these already installed. Add anything that is not
installed using the install.packages() function.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
library(raster)
## Warning: package 'raster' was built under R version 4.1.2
## Warning: package 'sp' was built under R version 4.1.2
library(sf)
## Warning: package 'sf' was built under R version 4.1.2
library(dismo)
## Warning: package 'dismo' was built under R version 4.1.2
library(tree)
## Warning: package 'tree' was built under R version 4.1.2
library(mlr3)
## Warning: package 'mlr3' was built under R version 4.1.2
library(mlr3learners)
## Warning: package 'mlr3learners' was built under R version 4.1.2
library(mlr3tuning)
## Warning: package 'mlr3tuning' was built under R version 4.1.2
## Warning: package 'paradox' was built under R version 4.1.2
library(vip)
library(pdp)
## Warning: package 'pdp' was built under R version 4.1.2
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.2
Next, we’ll read the known locations of Pinus edulis (Pinyon pine) trees from the file Pinus_edulis.csv, and plot to check the data
pe <- read.csv("../datafiles/Pinus_edulis.csv")
pe <- st_as_sf(pe, coords = c("longitude", "latitude"))
borders <- st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")
## Reading layer `ne_50m_admin_0_countries' from data source
## `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 241 features and 94 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## Geodetic CRS: WGS 84
plot(st_geometry(pe), xlim = c(-120,-80), ylim = c(25,55),
pch = 21, bg = "darkorange", axes=TRUE)
plot(st_geometry(borders), add = TRUE)
If you need to get your own species data, the dismo
package includes a function gbif() that allows you to
download records directly from the GBIF. To demonstrate this, we’ll use
it here to get occurrence records of the timber rattlesnake Crotalus
horridus. This function allows selection by Latin name, so here we
specify the genus and species names separately (this will take a few
second to retrieve all the records:
rattler <- gbif('crotalus','horridus')
The returned object has much more information about the occurrence records, including metadata about the original study that supplied the data. Note that all records are returned irrespective of whether or not they have associated geolocations, so we subset only the records that have coordinates (the function does have an argument to exclude records with no longitude or latitude).
rattler <- subset(rattler, !is.na(rattler$lon))
rattler <- st_as_sf(rattler, coords = c("lon", "lat"))
Finally we plot the records using the same code as before
plot(st_geometry(rattler), pch = 21, bg = "green", axes = TRUE)
plot(st_geometry(borders), add = TRUE)
Note that there are some odd locations, including an observation in Turkey. Before using these data in modeling, we would want to verify and potentially remove these records.
There are a large number of available data sources for environmental
data that can be used in species distribution models. We’ll take data
from the Worldclim project (Hijmans
et al. 2005), a collection of standardized climate data at a variety of
spatial resolutions. This data can be downloaded directly using the
getData() function, which allows direct downloads from this
and other datasets. The data contains monthly averages of temperature
and precipitation and a set of bioclimatic variables, which represent
aggregate climate variables assumed to be linked to species
distributions. I’ve already downloaded the bioclimatic variables for you
and clipped them to the region you’re going to work in. These are
available in two RData files containing modern
(current.env.RData) and future (future.env.RData)
climates for the study area.
load("../datafiles/current.env.RData")
load("../datafiles/future.env.RData")
The data are in R’s raster format, but as an 3D array. If you type the name of the objects that have been loaded, you see an overview of the data:
current.env
## class : RasterStack
## dimensions : 480, 720, 345600, 19 (nrow, ncol, ncell, nlayers)
## resolution : 0.04166667, 0.04166667 (x, y)
## extent : -130, -100, 30, 50 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13, bio14, bio15, ...
## min values : -7.1, 5.6, 2.4, 165.9, 6.0, -24.1, 11.8, -13.5, -15.3, 0.5, -15.9, 46.0, 7.0, 0.0, 5.0, ...
## max values : 23.9, 21.3, 6.5, 1299.2, 45.6, 10.0, 50.9, 33.2, 32.9, 36.0, 15.4, 3345.0, 556.0, 77.0, 123.0, ...
This tells us that the files are 480*720 pixels, with 19 layers, each
representing a single climate variables (the variables are listed in the
appendix of today’s lab). There’s also information about the extent and
resolution of the data. If you need to get different data or data for
another region, look at the help for the getData()
function.
You can extract any single layer using the raster()
function (e.g. raster(current.env, 1)) will give you the
first layer). We can use this to visualize the data:
plot(raster(current.env, 1))
plot(st_geometry(pe), add = TRUE, pch = 21, bg = "olivedrab")
The RColorBrewer package provides several color palettes for spatial data. If you have this installed, you can change the colors as follows:
my.pal <- brewer.pal(9, "YlOrRd")
plot(raster(current.env, 1), col = my.pal)
Finally, we make up a dataset for modeling. We want to model the
species distribution using a binary classification task, where 1
indicates presence and 0 indicates absence. However, we only have
observations of presences (nobody records when they don’t see a
species). To get around this, we’ll generate some ‘pseudo-absences’:
locations that represent an absence of this species. There are several
ways to generate these. One of the easiest is to simply pick random
locations within the study area using the randomPoints()
function. This takes as inputs:
This generates a matrix of coordinates that we then need to convert
to a spatial sf object. This is a bit long-winded, but
allows to easily plot the new points.
absence <- randomPoints(mask = raster(current.env, 1),
n = nrow(pe), p = st_coordinates(pe))
absence <- st_as_sf(as.data.frame(absence), coords = c("x", "y"))
Now we can compare the distribution of presences and absences:
plot(raster(current.env, 1))
plot(st_geometry(pe), pch = 21, bg = "olivedrab", add = TRUE)
plot(st_geometry(absence), pch = 21, bg = "white", add = TRUE)
One problem with this method is that we sample absences very far from
the observed points. As the environment of these points is very unlike
the places where the species are found, these will always predict as
absences, and can bias our assessment of the model. A better approach is
to restrict the absences to a small region around the observed points.
We could crop the environmental mask to a smaller region, or use a
buffer approach. We’ll do the second of these here. We first general
200km buffers around each observed presence using the
circles() function. Then use the spsample()
function to pick random points within these buffers:
x <- circles(st_coordinates(pe), d = 200000, lonlat = TRUE)
absence <- as(spsample(x@polygons, type = 'random', n = nrow(pe)), "sf")
plot(raster(current.env,1))
plot(st_geometry(pe), pch = 21, bg = "olivedrab", add = TRUE)
plot(st_geometry(absence), pch = 21, bg = "white", add = TRUE)
Now we have the presence and absence points, we need to extract the environmental values for these points (these will be a features for machine learning). To do this we use the following steps:
extract() function to get the associated
climate values from the environmental gridspe.crds <- rbind(st_coordinates(pe), st_coordinates(absence))
pe.env <- raster::extract(current.env, pe.crds)
pe.pa <- as.factor(c(rep(1, nrow(pe)), rep(0, nrow(pe))))
pe.df <- data.frame(pe.crds, pa = pe.pa, pe.env)
row.names(pe.df) <- 1:598
And having done all that, we can now move on to building our models.
Classification and Regression Trees (CART) is a non-linear, non-parametric modeling approach that can be used with a wide variety of data. Regression trees are used with continuous outcome data, and classification trees with binary or categorical data, but the interface for these is the same in R.
R has a number of packages for performing CART and associated analyses. We’re going to use the tree package to demonstrate how this works with the California data, then we’ll move to using mlr3 which uses the rpart package. While the syntax will be different, there is little difference in the way these work:
We’ll build a classification model for the Pinus edulis data
set using mlr3. CART methods are implemented in
mlr3 using the rpart package and the
two learners are classif.rpart for a classification tree
and regr.rpart for a regression tree.
As a recall, we need to identify the task, define a performance measure and a resampling strategy. Then we train the learner and evaluate the outcomes
task_pe <- TaskClassif$new(id = "pe", backend = pe.df,
target = "pa")
## Check the task details
task_pe$col_roles
## $feature
## [1] "X" "Y" "bio1" "bio10" "bio11" "bio12" "bio13" "bio14" "bio15"
## [10] "bio16" "bio17" "bio18" "bio19" "bio2" "bio3" "bio4" "bio5" "bio6"
## [19] "bio7" "bio8" "bio9"
##
## $target
## [1] "pa"
##
## $name
## character(0)
##
## $order
## character(0)
##
## $stratum
## character(0)
##
## $group
## character(0)
##
## $weight
## character(0)
The task is currently using all features including the coordinates
(X and Y). We can exclude these from the
feature list:
task_pe$col_roles$feature <- setdiff(task_pe$col_roles$feature,
c("X", "Y"))
task_pe$feature_names
## [1] "bio1" "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17"
## [10] "bio18" "bio19" "bio2" "bio3" "bio4" "bio5" "bio6" "bio7" "bio8"
## [19] "bio9"
measure = msr("classif.auc")
lrn_ct = lrn("classif.rpart", predict_type = "prob")
resamp_hout = rsmp("holdout", ratio = 0.8)
resamp_hout$instantiate(task_pe)
We then run the resampler and examine the performance
rr = resample(task_pe, lrn_ct, resamp_hout, store_models = TRUE)
rr$score(measure)
## task task_id learner learner_id
## 1: <TaskClassif[50]> pe <LearnerClassifRpart[38]> classif.rpart
## resampling resampling_id iteration prediction
## 1: <ResamplingHoldout[20]> holdout 1 <PredictionClassif[20]>
## classif.auc
## 1: 0.7962489
rr$aggregate(measure)
## classif.auc
## 0.7962489
With the default settings, this has worked fairly well, but we will
next try to improve on this by tuning the hyperparameters. Before doing
so, we can extract any of the trees that were built during the
resampling. These are stored in a list in rr$learners, and
you can select an individual models using R’s list index
[[i]]. As we are using a hold-out, there is only one
learner in this list ([[1]]). With a 5-fold
cross-validation, there would be 5 ([[1]]...[[5]]).
rr$learners[[1]]$model
## n= 478
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 478 230 1 (0.48117155 0.51882845)
## 2) bio17< 31.5 108 12 0 (0.88888889 0.11111111) *
## 3) bio17>=31.5 370 134 1 (0.36216216 0.63783784)
## 6) bio1< 4.75 47 6 0 (0.87234043 0.12765957) *
## 7) bio1>=4.75 323 93 1 (0.28792570 0.71207430)
## 14) bio5>=32.35 75 31 0 (0.58666667 0.41333333)
## 28) bio17< 46.5 41 7 0 (0.82926829 0.17073171) *
## 29) bio17>=46.5 34 10 1 (0.29411765 0.70588235) *
## 15) bio5< 32.35 248 49 1 (0.19758065 0.80241935)
## 30) bio12< 371 85 32 1 (0.37647059 0.62352941)
## 60) bio10< 21.75 75 32 1 (0.42666667 0.57333333)
## 120) bio8< 10.55 14 3 0 (0.78571429 0.21428571) *
## 121) bio8>=10.55 61 21 1 (0.34426230 0.65573770) *
## 61) bio10>=21.75 10 0 1 (0.00000000 1.00000000) *
## 31) bio12>=371 163 17 1 (0.10429448 0.89570552)
## 62) bio13< 50.5 20 8 1 (0.40000000 0.60000000)
## 124) bio2< 16 7 1 0 (0.85714286 0.14285714) *
## 125) bio2>=16 13 2 1 (0.15384615 0.84615385) *
## 63) bio13>=50.5 143 9 1 (0.06293706 0.93706294) *
We’ll use the rpart.plot package to visualize the
first of these. This allows you to make a large number of tweaks to a
tree plot. Here extra=106 adds the following to each
terminal node: the predicted value, the proportion of 1s
and the percentage of observations in that node:
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.2
## Loading required package: rpart
prp(rr$learners[[1]]$model, extra = 106, roundint = FALSE)
Classification and regression trees have a large number of hyper parameters and benefit from tuning. We can do this using the mlr3tuning package. There are several steps here:
We’ll use the definitions from the previous section
Parameter sets can be generated using the paradox package. This should have been installed along with mlr3, so load this now.
library(paradox)
Next check the available parameters for our learner
lrn_ct$param_set
## <ParamSet>
## id class lower upper nlevels default value
## 1: cp ParamDbl 0 1 Inf 0.01
## 2: keep_model ParamLgl NA NA 2 FALSE
## 3: maxcompete ParamInt 0 Inf Inf 4
## 4: maxdepth ParamInt 1 30 30 30
## 5: maxsurrogate ParamInt 0 Inf Inf 5
## 6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 7: minsplit ParamInt 1 Inf Inf 20
## 8: surrogatestyle ParamInt 0 1 2 0
## 9: usesurrogate ParamInt 0 2 3 2
## 10: xval ParamInt 0 Inf Inf 10 0
This table also gives the type of parameter (e.g. double precision or
integer), the lower and upper bounds and the default value. We’ll test
values for the complexity parameter (cp) between 0.001 and
0.1, and the minimum number of observations to considering partitioning
a node (minsplit) from 1 to 12.
## Build parameter set
tune_ps = ParamSet$new(list(
ParamDbl$new("cp", lower = 0.0001, upper = 0.1),
ParamInt$new("minsplit", lower = 2, upper = 20)
))
tune_ps
## <ParamSet>
## id class lower upper nlevels default value
## 1: cp ParamDbl 1e-04 0.1 Inf <NoDefault[3]>
## 2: minsplit ParamInt 2e+00 20.0 19 <NoDefault[3]>
Next we defining one or more stopping criteria for the tuning. This is largely to ensure that tuning for highly complex algorithms is manageable. The available stopping criteria include:
We’ll use the second of these. The function to set the terminator is
term(), and we set the number of evaluations to 50.
evals = trm("evals", n_evals = 50)
TunerNow, we set up a sampling strategy for searching among different
hyperparameter values. There are a couple of options here; we will use a
grid search, where the argument resolution gives the number
of steps between the lower and upper bounds defined in our
ParamSet.
tuner = tnr("grid_search", resolution = 10)
The mlr3tuning package offers a couple of ways to
tune. Either by first running the tuning, then using these parameters to
train the final model, or combining these using
AutoTuner(). We’ll use the second of these here - it’s a
little neater, and has one other advantage as we will see in later.
First create a new AutoTuner using the various other
functions and parameters that we have just defined
at_ct = AutoTuner$new(learner = lrn_ct,
resampling = rsmp("holdout"),
measure = measure,
search_space = tune_ps,
terminator = evals,
tuner = tuner)
Note that we use a holdout method in the AutoTuner to split the data.
This will be used to assess how the model’s skill changes as we vary the
parameters. Each time it will evaluate the measure on the holdout test
set. Whichever parameter set gives the best performance will then be
automatically used to train a final model. The AutoTuner
object inherits from the Learner methods we have previously
seen, so to tune and train the model, just type:
at_ct$train(task_pe)
We can then see the results in at_ct$learner, showing a
value of cp of 0.0334 and minsplit of 16.
at_ct$learner
The previous code has trained a model, but it has not evaluated it.
While this used a resampling strategy (the holdout), this is only used
to select the best value of the hyperparameters. To evaluate the final
trained model, we need to use an independent dataset. Fortunately, this
is quite easy to set up using the AutoTuner.
To understand the following code, we need to define the inner vs. the outer resampling strategy.
Here, we’ll use a hold-out for the inner strategy, and a 3-fold cross validation for the outer.
resampling_inner = rsmp("holdout", ratio = 0.8)
resampling_outer = rsmp("cv", folds = 3)
Next we’ll remake the AutoTuner with the inner
strategy:
at_ct = AutoTuner$new(learner = lrn_ct,
resampling = resampling_inner,
measure = measure,
search_space = tune_ps,
terminator = evals,
tuner = tuner)
And now run resample using the AutoTuner as
the learner, and the outer resampling strategy (this will take a minute
or so to run):
rr_ct = resample(task = task_pe, learner = at_ct,
resampling = resampling_outer, store_models = TRUE)
This will take a little while to run; remember that this is dividing the original data set up three times, then for each one tuning the parameters across the parameter set/resolution. You can see the selected set of parameters for each of the outer cross-validation folds as follows:
sapply(rr_ct$learners, function(x) x$param_set$values)
## [[1]]
## named list()
##
## [[2]]
## named list()
##
## [[3]]
## named list()
rr_ct$score(measure)
## task task_id learner learner_id
## 1: <TaskClassif[50]> pe <AutoTuner[46]> classif.rpart.tuned
## 2: <TaskClassif[50]> pe <AutoTuner[46]> classif.rpart.tuned
## 3: <TaskClassif[50]> pe <AutoTuner[46]> classif.rpart.tuned
## resampling resampling_id iteration prediction
## 1: <ResamplingCV[20]> cv 1 <PredictionClassif[20]>
## 2: <ResamplingCV[20]> cv 2 <PredictionClassif[20]>
## 3: <ResamplingCV[20]> cv 3 <PredictionClassif[20]>
## classif.auc
## 1: 0.8714646
## 2: 0.7682186
## 3: 0.8323202
rr_ct$aggregate(measure)
## classif.auc
## 0.8240011
We get a small improvement here, but not much, and in general,
rpart doesn’t improve a lot with tuning. We’ll turn now to
some more complex methods, to see if we can improve on these
results.
Next, we’ll try to build a random forest for the (Pinus
edulis data. The mlr3 package uses a random forest
routine from the ranger package that can run on
multiple cores. To build this, all we need to do is create a new learner
(classif.ranger), and run the hold-out resampling on
it:
lrn_rf = lrn("classif.ranger",
predict_type = "prob",
importance = "permutation")
rr = resample(task_pe, lrn_rf, resamp_hout, store_models = TRUE)
rr$score(measure)
## task task_id learner learner_id
## 1: <TaskClassif[50]> pe <LearnerClassifRanger[38]> classif.ranger
## resampling resampling_id iteration prediction
## 1: <ResamplingHoldout[20]> holdout 1 <PredictionClassif[20]>
## classif.auc
## 1: 0.8752487
rr$aggregate(measure)
## classif.auc
## 0.8752487
The default random forest shows an improvement over the CART model. We’ll next try to tune this to see if we can improve it’s performance. First, let’s take a look at the set of available hyperparameters:
lrn_rf$param_set
## <ParamSet>
## id class lower upper nlevels default
## 1: alpha ParamDbl -Inf Inf Inf 0.5
## 2: always.split.variables ParamUty NA NA Inf <NoDefault[3]>
## 3: class.weights ParamUty NA NA Inf
## 4: holdout ParamLgl NA NA 2 FALSE
## 5: importance ParamFct NA NA 4 <NoDefault[3]>
## 6: keep.inbag ParamLgl NA NA 2 FALSE
## 7: max.depth ParamInt 0 Inf Inf
## 8: min.node.size ParamInt 1 Inf Inf
## 9: min.prop ParamDbl -Inf Inf Inf 0.1
## 10: minprop ParamDbl -Inf Inf Inf 0.1
## 11: mtry ParamInt 1 Inf Inf <NoDefault[3]>
## 12: mtry.ratio ParamDbl 0 1 Inf <NoDefault[3]>
## 13: num.random.splits ParamInt 1 Inf Inf 1
## 14: num.threads ParamInt 1 Inf Inf 1
## 15: num.trees ParamInt 1 Inf Inf 500
## 16: oob.error ParamLgl NA NA 2 TRUE
## 17: regularization.factor ParamUty NA NA Inf 1
## 18: regularization.usedepth ParamLgl NA NA 2 FALSE
## 19: replace ParamLgl NA NA 2 TRUE
## 20: respect.unordered.factors ParamFct NA NA 3 ignore
## 21: sample.fraction ParamDbl 0 1 Inf <NoDefault[3]>
## 22: save.memory ParamLgl NA NA 2 FALSE
## 23: scale.permutation.importance ParamLgl NA NA 2 FALSE
## 24: se.method ParamFct NA NA 2 infjack
## 25: seed ParamInt -Inf Inf Inf
## 26: split.select.weights ParamUty NA NA Inf
## 27: splitrule ParamFct NA NA 3 gini
## 28: verbose ParamLgl NA NA 2 TRUE
## 29: write.forest ParamLgl NA NA 2 TRUE
## id class lower upper nlevels default
## parents value
## 1:
## 2:
## 3:
## 4:
## 5: permutation
## 6:
## 7:
## 8:
## 9:
## 10:
## 11:
## 12:
## 13: splitrule
## 14: 1
## 15:
## 16:
## 17:
## 18:
## 19:
## 20:
## 21:
## 22:
## 23: importance
## 24:
## 25:
## 26:
## 27:
## 28:
## 29:
## parents value
The two most frequently tuned parameters are mtry (the
number of variables used per split) and num.trees (the
number of trees built). Next we define a parameter set for these:
## Build parameter set
tune_ps = ParamSet$new(list(
ParamInt$new("mtry", lower = 1, upper = 8),
ParamInt$new("num.trees", lower = 100, upper = 1000)
))
Then, we define a new AutoTuner. This is simply a copy
of the previous one, but with an updated learner and parameter set. Note
that we again set the inner resampling strategy.
at_rf = AutoTuner$new(learner = lrn_rf,
resampling = resampling_inner,
measure = measure,
search_space = tune_ps,
terminator = evals,
tuner = tuner)
Again, we use resample to tune the model within the
outer resampling strategy:
rr_rf = resample(task = task_pe, learner = at_rf,
resampling = resampling_outer, store_models = TRUE)
This will take a little while to run; remember that this is dividing the original data set up three times, then for each one running tuning the parameters across the parameter set/resolution.
rr_rf$score(measure)
## task task_id learner learner_id
## 1: <TaskClassif[50]> pe <AutoTuner[46]> classif.ranger.tuned
## 2: <TaskClassif[50]> pe <AutoTuner[46]> classif.ranger.tuned
## 3: <TaskClassif[50]> pe <AutoTuner[46]> classif.ranger.tuned
## resampling resampling_id iteration prediction
## 1: <ResamplingCV[20]> cv 1 <PredictionClassif[20]>
## 2: <ResamplingCV[20]> cv 2 <PredictionClassif[20]>
## 3: <ResamplingCV[20]> cv 3 <PredictionClassif[20]>
## classif.auc
## 1: 0.8722826
## 2: 0.8979446
## 3: 0.9221212
rr_rf$aggregate(measure)
## classif.auc
## 0.8974495
And again, we see a small jump in the AUC with the newly tuned model. To see the selected parameter values
sapply(rr_rf$learners, function(x) x$param_set$values)
## [[1]]
## named list()
##
## [[2]]
## named list()
##
## [[3]]
## named list()
Next we’ll plot the permutation-based variable importance for this
model. As a reminder, variable importance is a measure of how much worse
a model becomes when we scramble the values of one of the features. The
model is used to predict the outcome for some test data (here the
out-of-bag samples) twice: once with the original values of the feature
and once with randomly shuffled values. If there is a large difference
in the skill of the model, this feature is important in controlling the
outcome. We’ll use the vip() function from the
vip to show and then plot the variable importance
scores. As there are three possible models from the resampling, we’ll
just plot the first. The model object is buried quite deep in the
resampling output in
rr_rf$learners[[1]]$model$learner$model:
vip(rr_rf$learners[[1]]$model$learner$model)
Which indicates that the bio10 (the annual temperature
range) is most important in controlling the distribution of the
pine.
We can look at the form of the relationship between the occurrence of
the pine and this feature (and any other one) using a partial dependency
plot. This shows changes in the outcome across the range of some feature
(with all other features held constant). Here, we’ll use the
partial() function from the the pdp
package to produce the plot. As arguments, this requires the model, the
feature that you want the dependency on, the set of data used to produce
the model. For a classification model, we can also specify which class
to plot for. In this data, the presences (1) are the second class.
partial(rr_rf$learners[[1]]$model$learner$model,
pred.var = "bio7", prob = TRUE,
train = pe.df, plot = TRUE, which.class = 2)
The plot shows the range over which this species is found, with a clear peak in suitability at about 36-37 degrees.
We will now build a boosted regression tree model for the Pinus data. In contrast to random forests that build a set of individual weak trees, boosted regression trees (BRTs) start with a single weak tree and iteratively improve on this. This is done by targeting the residuals from the previous set of models and trying to model that in the next tree. While this can make these methods very powerful, it is easy for them to overfit the data, and hyperparameter tuning becomes very important here.
We’ll follow the same steps as with the random forest.
mlr3 uses the xgboost package, so we
define our learner as classif.xgboost. The only
hyperparameter that we will set here is subsample. This
runs a stochastic boosting where each individual tree is built with a
random subset of 50% of the training data.
lrn_brt = lrn("classif.xgboost",
predict_type = "prob",
subsample = 0.5)
rr = resample(task_pe, lrn_brt, resamp_hout, store_models = TRUE)
rr$score(measure)
rr$aggregate(measure)
With the default settings, our BRT performs much worse than the random forest, and about the same as the initial CART model. We’ll next try to tune this to see if we can improve it’s performance. First, let’s take a look at the set of available hyperparameters:
lrn_brt$param_set
There is a long list of potential parameters to tune here. We’ll
focus on three: eta (the learning rate),
max.depth (the number of splits in a tree) and
nrounds (the number of boosting iterations). Next we define
a parameter set for these:
tune_ps = ParamSet$new(list(
ParamDbl$new("eta", lower = 0.001, upper = 0.1),
ParamInt$new("max_depth", lower = 1, upper = 6), ## Raise upper limit
ParamInt$new("nrounds", lower = 100, upper = 1000)
))
Then, we define a new AutoTuner. This is simply a copy
of the previous one, but with an updated learner and parameter set. Note
that we again set the inner resampling strategy.
at_brt = AutoTuner$new(learner = lrn_brt,
resampling = resampling_inner,
measure = measure,
search_space = tune_ps,
terminator = evals,
tuner = tuner)
Again, we use resample to tune the model within the
outer resampling strategy. This may take a few minutes to run - remember
we are dividing the data into 3 folds, then running 50 tuning iterations
for each one.
rr_brt = resample(task = task_pe, learner = at_brt,
resampling = resampling_outer, store_models = TRUE)
sapply(rr_brt$learners, function(x) x$param_set$values)
You should a much more substantial improvement in the AUC here.
rr_brt$score(measure)
## task task_id learner learner_id
## 1: <TaskClassif[50]> pe <AutoTuner[46]> classif.xgboost.tuned
## 2: <TaskClassif[50]> pe <AutoTuner[46]> classif.xgboost.tuned
## 3: <TaskClassif[50]> pe <AutoTuner[46]> classif.xgboost.tuned
## resampling resampling_id iteration prediction
## 1: <ResamplingCV[20]> cv 1 <PredictionClassif[20]>
## 2: <ResamplingCV[20]> cv 2 <PredictionClassif[20]>
## 3: <ResamplingCV[20]> cv 3 <PredictionClassif[20]>
## classif.auc
## 1: 0.8892915
## 2: 0.8970618
## 3: 0.8839195
rr_brt$aggregate(measure)
## classif.auc
## 0.8900909
As before, we can use a variable importance plot to look at the contribution of the individual features to the model
vip(rr_brt$learners[[1]]$model$learner$model)
In this last section, we will produce predictive maps of suitability for the pine for the modern and future periods. The first step will be to make a final model based on the tuned hyperparameters and the full dataset. We’ll use the random forest here as an example. As a reminder, the results of the tuning were:
sapply(rr_rf$learners, function(x) x$param_set$values)
## [[1]]
## named list()
##
## [[2]]
## named list()
##
## [[3]]
## named list()
We’ll use the most common values of 100 trees and
mtry = 1 for our final model and train it on the full
dataset.
final_rf = lrn("classif.ranger",
predict_type = "prob",
importance = "permutation",
mtry = 1,
num.trees = 100)
final_rf$train(task_pe)
For prediction, we now need to extract the environmental data for every grid cell in our region into a dataframe. We also get a list of grid cell coordinates (this will help with plotting the results).
current.env.df <- as.data.frame(getValues(current.env))
grid.crds <- coordinates(current.env)
Next, we remove any grid cell with missing values (those over the oceans), and the associated coordinates. Finally here we make a list of the grid cell number or index. We’ll use this to make maps of the predictions.
naID <- which(is.na(current.env.df$bio1))
current.env.df <- current.env.df[-naID,]
grid.crds <- grid.crds[-naID,]
grid.idx <- cellFromXY(current.env, grid.crds)
We can now use this dataframe with the predict_newdata
method to estimate the suitability.
pe.curr.pred <- final_rf$predict_newdata(current.env.df)
And finally, we use the grid cell indices to make a new raster layer showing the predictions. We do this by first making a template raster as a copy of one of the environmental layers, then using the grid cell index to set values to the probability of suitability. Finally, we plot this and overlay the original points:
pred.pal = brewer.pal(9, "Reds")
pe.curr.prob.r <- raster(current.env,1)
pe.curr.prob.r[grid.idx] <- pe.curr.pred$prob[,2]
plot(pe.curr.prob.r, col = pred.pal)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)
This shows that model captures the current distribution well, but also predicts a large area of suitability in the north-west. We can also plot the predicted class (0/1). This requires converting it from R’s factor class to a numeric value
pe.curr.class.r = raster(current.env,1)
tmp.pred <- as.numeric(pe.curr.pred$response) - 1
pe.curr.class.r[grid.idx] <- tmp.pred
plot(pe.curr.class.r)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)
Now let’s do the same with the future environment. First extract the values and coordinates.
future.env.df = as.data.frame(getValues(future.env))
grid.crds = coordinates(future.env)
names(future.env.df) <- names(current.env.df)
Now remove the missing values and set the grid index
naID = which(is.na(future.env.df$bio1))
future.env.df = future.env.df[-naID,]
grid.crds = grid.crds[-naID,]
grid.idx = cellFromXY(future.env, grid.crds)
Predict the suitability:
pe.rcp85.pred = rr_rf$learners[[1]]$model$learner$predict_newdata(future.env.df)
And plot it:
pe.rcp85.prob.r <- raster(future.env,1)
pe.rcp85.prob.r[grid.idx] <- pe.rcp85.pred$prob[,2]
plot(pe.rcp85.prob.r, col = pred.pal)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)
You should here that the suitability has been reduced in the south, but increased in the north with warming temperatures. This is shown more clearly by looking at the predicted classes
pe.rcp85.class.r <- raster(future.env,1)
tmp.pred <- as.numeric(pe.rcp85.pred$response) - 1
pe.rcp85.class.r[grid.idx] <- tmp.pred
plot(pe.rcp85.class.r)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)
And finally, we can illustrate the change in range:
my.pal <- brewer.pal(3,'PRGn')
plot(pe.rcp85.class.r - pe.curr.class.r, col=my.pal)
The file Sonar.csv contains values of 208 sonar signals. The
data have 60 features, each representing “the energy within a particular
frequency band, integrated over a certain period of time”, and an
outcome variable Class, which is coded M or
R. The goal of the experiment is to discriminate between
the sonar signals bounced of a rock R or a metal object (a
mine M). The exercise for this lab is to use one of the
ensemble methods (random forest or boosted regression trees) to
produce a new model of these data. You should use mlr3
package, and you need to do the following:
AutoTuner to tune your modelIn addition, your answer should include your R code.